1 Introduction

The intent of this file is to sintetize EDA approaches learned during Udacity’s Nanodegree in R language. For the analysis, a database from a Chicago bike sharing company will be used.

2 Preparação

First, let’s download all libraries that will be used, as well as the bike sharing database.

2.1 Libraries

# Data visualization
#install.packages('ggplot2')
library('ggplot2')
#install.packages("knitr")
#install.packages('plotly')
library('plotly') 
#install.packages('gridExtra')
library('gridExtra')

# Dat manipulation
#install.packages('dplyr')
library('dplyr')
#install.packages("tidyr")
library('tidyr')

2.2 Database Import

filename <- "chicago.txt"

df <- read.csv(filename,sep=",")
class(df)
## [1] "data.frame"

3 EDA

3.1 Summary

Let’s check how many rows and columns our dataset have, as well as check the summary of all the variables:

dim(df)
## [1] 1551505       8
summary(df)
##                Start.Time                     End.Time      
##  2017-04-15 13:27:10:      7   2017-05-21 13:07:49:      8  
##  2017-05-29 15:19:36:      7   2017-06-10 13:31:26:      8  
##  2017-04-15 13:02:35:      6   2017-02-18 14:24:00:      7  
##  2017-04-26 16:07:17:      6   2017-06-02 17:16:46:      7  
##  2017-05-15 17:14:43:      6   2017-06-03 14:32:22:      7  
##  2017-05-16 17:56:13:      6   2017-06-10 18:47:17:      7  
##  (Other)            :1551467   (Other)            :1551461  
##  Trip.Duration                          Start.Station    
##  Min.   :   60.0   Streeter Dr & Grand Ave     :  36686  
##  1st Qu.:  392.0   Clinton St & Washington Blvd:  22429  
##  Median :  670.0   Lake Shore Dr & Monroe St   :  21822  
##  Mean   :  939.8   Clinton St & Madison St     :  19098  
##  3rd Qu.: 1127.0   Canal St & Adams St         :  18105  
##  Max.   :86338.0   Lake Shore Dr & North Blvd  :  17729  
##                    (Other)                     :1415636  
##                        End.Station           User.Type      
##  Streeter Dr & Grand Ave     :  39537   Customer  : 317162  
##  Clinton St & Washington Blvd:  21113   Dependent :      4  
##  Clinton St & Madison St     :  20969   Subscriber:1234339  
##  Lake Shore Dr & Monroe St   :  20652                       
##  Lake Shore Dr & North Blvd  :  19734                       
##  Theater on the Lake         :  18529                       
##  (Other)                     :1410971                       
##     Gender         Birth.Year    
##        :316867   Min.   :1899    
##  Female:298784   1st Qu.:1975    
##  Male  :935854   Median :1984    
##                  Mean   :1981    
##                  3rd Qu.:1989    
##                  Max.   :2016    
##                  NA's   :316683

Our dataframe has around 1,5M rows and 8 columns. The variables goes as following:

  • Start.Time = timestamp when the user starts a ride
  • End.Time = timestamp when the user finishes a ride
  • Trip.Duration = for how long the user rides the rented bike (in seconds)
  • Start.Station = station where the user got the bike
  • End.Station = station where the user left the bike
  • User.Type = can be customer, dependent or subscriber
  • Gender
  • Birth.Year

Let’s take a look at the first 20 rows of the dataset:

head(df,20)
##             Start.Time            End.Time Trip.Duration
## 1  2017-01-01 00:00:36 2017-01-01 00:06:32           356
## 2  2017-01-01 00:02:54 2017-01-01 00:08:21           327
## 3  2017-01-01 00:06:06 2017-01-01 00:18:31           745
## 4  2017-01-01 00:07:28 2017-01-01 00:12:51           323
## 5  2017-01-01 00:07:57 2017-01-01 00:20:53           776
## 6  2017-01-01 00:10:44 2017-01-01 00:21:27           643
## 7  2017-01-01 00:11:34 2017-01-01 00:23:47           733
## 8  2017-01-01 00:14:57 2017-01-01 00:26:22           685
## 9  2017-01-01 00:15:03 2017-01-01 00:26:28           685
## 10 2017-01-01 00:17:01 2017-01-01 00:29:49           768
## 11 2017-01-01 00:17:13 2017-01-01 11:03:34         38781
## 12 2017-01-01 00:18:28 2017-01-01 00:31:05           757
## 13 2017-01-01 00:18:50 2017-01-01 00:21:47           177
## 14 2017-01-01 00:23:41 2017-01-01 00:29:13           332
## 15 2017-01-01 00:25:47 2017-01-01 00:39:53           846
## 16 2017-01-01 00:25:47 2017-01-01 00:43:23          1056
## 17 2017-01-01 00:26:21 2017-01-01 00:39:40           799
## 18 2017-01-01 00:27:21 2017-01-01 00:42:59           938
## 19 2017-01-01 00:27:28 2017-01-01 00:42:44           916
## 20 2017-01-01 00:27:45 2017-01-01 00:31:13           208
##                           Start.Station                   End.Station
## 1                  Canal St & Taylor St      Canal St & Monroe St (*)
## 2            Larrabee St & Menomonee St  Sheffield Ave & Kingsbury St
## 3  Orleans St & Chestnut St (NEXT Apts)    Ashland Ave & Blackhawk St
## 4               Franklin St & Monroe St        Clinton St & Tilden St
## 5                  Broadway & Barry Ave       Sedgwick St & North Ave
## 6                  State St & Kinzie St            Wells St & Polk St
## 7                Wabash Ave & Wacker Pl        Clinton St & Tilden St
## 8                    Daley Center Plaza      Canal St & Monroe St (*)
## 9                    Daley Center Plaza      Canal St & Monroe St (*)
## 10                Dayton St & North Ave       Ogden Ave & Chicago Ave
## 11           Wilton Ave & Diversey Pkwy   Halsted St & Wrightwood Ave
## 12                Canal St & Madison St      LaSalle St & Illinois St
## 13                  Theater on the Lake Lakeview Ave & Fullerton Pkwy
## 14              Halsted St & Maxwell St          Halsted St & 18th St
## 15        Ravenswood Ave & Lawrence Ave    Clarendon Ave & Gordon Ter
## 16             Clark St & Congress Pkwy         Wolcott Ave & Polk St
## 17        Ravenswood Ave & Lawrence Ave    Clarendon Ave & Gordon Ter
## 18                      Millennium Park        Michigan Ave & 18th St
## 19                      Millennium Park        Michigan Ave & 18th St
## 20              Damen Ave & Chicago Ave       Damen Ave & Division St
##     User.Type Gender Birth.Year
## 1    Customer                NA
## 2  Subscriber   Male       1984
## 3  Subscriber   Male       1985
## 4  Subscriber   Male       1990
## 5  Subscriber   Male       1990
## 6  Subscriber   Male       1970
## 7  Subscriber   Male       1986
## 8    Customer                NA
## 9    Customer                NA
## 10   Customer                NA
## 11 Subscriber Female       1988
## 12   Customer                NA
## 13 Subscriber   Male       1991
## 14 Subscriber   Male       1984
## 15 Subscriber Female       1987
## 16 Subscriber   Male       1984
## 17 Subscriber   Male       1987
## 18 Subscriber   Male       1991
## 19 Subscriber Female       1990
## 20 Subscriber   Male       1982

3.2 Start Time

How are the rides distributed along each day of the week? Is there a difference between weekdays and weekends?

#find which weekday each date represent
df$day <- weekdays(as.Date(df$Start.Time)) 

#group by weekday and summarise the Trip Duration by counting the occurrences
grouped_per_weekday <- df %>%
              group_by(day) %>%
              summarise(count_trip_duration = round(length(Trip.Duration),0))

#barplot - count of trips per weekday
ggplot(grouped_per_weekday, 
       aes(x = factor(day,c('Sunday','Monday','Tuesday','Wednesday',
                            'Thursday','Friday','Saturday')), 
           y = count_trip_duration)) +
        geom_col() +
        geom_text(aes(label=count_trip_duration), 
                    position=position_dodge(width=0.9), 
                    vjust=-0.25) +
        ggtitle("Count of Trips per Weekday") +
        xlab("Number of Trips") +
        ylab("Weekday")

#group by weekday and summarise the Trip Duration by the mean of trip duration
grouped_per_weekday <- df %>%
              group_by(day) %>%
              summarise(mean_trip_duration = round(mean(Trip.Duration),0))

#barplot - mean of trip duration per weekday
ggplot(grouped_per_weekday, 
       aes(x = factor(day,c('Sunday','Monday','Tuesday','Wednesday',
                            'Thursday','Friday','Saturday')), 
           y = mean_trip_duration)) +
       geom_col() +
       geom_text(aes(label=mean_trip_duration), 
                   position=position_dodge(width=0.9), 
                   vjust=-0.25) +
       ggtitle("Average Trip Duration (in seconds) per Weekday") +
       xlab("Average Trip Duration (in seconds)") +
       ylab("Weekday")

Although there are slightly fewer bike trips during the weekend, the trip duration are considerably higher when compared to week days.

3.3 Trip duration

#histogram = frequency of each trip duration (in seconds)
ggplot(df, aes(Trip.Duration)) +
        geom_histogram() +
        ggtitle("Histogram for Trip Duration (in seconds)") +
        xlab("Frequency") +
        ylab("Trip Duration (in seconds)")

Since the trip duration data is over-dispersed, let’s apply a logarithm and a square root transformation on it.

#histogram = frequency of each trip duration (in seconds)
p1 <- ggplot(aes(x = Trip.Duration), data = df) +
      geom_histogram() +
      ggtitle("Histogram for Trip Duration (in seconds)") +
      xlab("Frequency") +
      ylab("Trip Duration")
#transformed histogram based on previous one using Log10
p2 <- p1 + scale_x_log10() +
         ggtitle("Histogram for Trip Duration with Log10 normalization (in seconds)")
#transformed histogram based on the first one using sqrt
p3 <- p1 + scale_x_sqrt() +
         ggtitle("Histogram for Trip Duration with Sqrt normalization (in seconds)")

grid.arrange(p1,p2,p3,ncol=1)

3.4 Gender

Let’s check how many male riders we have: 935854

What about female riders? 298784

Let’s check how each Gender compare considering their average trip duration:

by(df$Trip.Duration,df$Gender,summary)
## df$Gender: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     927    1379    1875    1962   86338 
## -------------------------------------------------------- 
## df$Gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0   398.8   648.0   783.7  1010.0 85782.0 
## -------------------------------------------------------- 
## df$Gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0   338.0   541.0   673.1   861.0 86096.0

Although there’s significantly more male riders than female ones, the female have longer trip durations on average. We can check how they compare using a frequency polygon:

#plot -> frequency polygon of trip duration per gender
ggplot(aes(x = Trip.Duration), 
       data = subset(df, Gender %in% c('Male','Female'))) +
      geom_freqpoly(aes(color = Gender)) +
      scale_x_log10() +
      ggtitle("Frequency Polygon Comparing Gender based on Trip Duration (in seconds)") +
      xlab("Frequency")

We can see that the female normal distribution is slightly more left-skewed than the male curve, validating what we saw earlier.

3.5 Birth year

#histogram -> birth year
ggplot(df, aes(Birth.Year)) + geom_histogram() + 
      ggtitle("Birth Year Histogram") +     
      xlab("Frequency") +
      ylab("Birth Year")

Most users birth year falls between 1950 and 2000, with a peak around 1980-1990.

Distribution of types of user:

#group by user type
grouped_users <- df %>%
              group_by(User.Type) %>%
              summarise(count_trips = length(Trip.Duration))

#bar plot -> count of trips per user type
ggplot(grouped_users, aes(User.Type,count_trips)) +
      geom_col() +
      geom_text(aes(label=count_trips), 
                  position=position_dodge(width=0.9), 
                  vjust=-0.25) +
      ggtitle("Count of each User Type") +
      xlab("Count of Trips") +
      ylab("User Type")

We can see that most of the riders are actual subscribers.

Using the plotly package, we can creat the trip duration per birth year of users, also showing the start and end station. Since the chart is heavy for the amout of data, we will use a 10k sample of the data.

#set seed since a sample will be used
set.seed(1)
#make a sample of 10K
sample_df <- df[sample(nrow(df), 10000), ]
#scatter-plot -> trip duration distribution per birth year 
plot_ly(data = sample_df, 
        x = sample_df$Trip.Duration , 
        y = sample_df$Birth.Year,
        marker = list(size = 10,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',width = 2)),
        text = ~paste("Start Station: ", 
                      sample_df$Start.Station, 
                      '$<br>Finish Station:',
                      sample_df$End.Station)
        ) %>%
  layout(title = 'Trip duration distribution (in seconds) per Birth Year',
         yaxis = list(zeroline = FALSE, title = "Birth Year"),
         xaxis = list(zeroline = FALSE,title = 'Trip Duration (in seconds)'))

We can see that the majority of trips stay under the 10k seconds threshold and user age doesn’t really seem to have an impact on it. The ride with most duration departed from Clinton st and lasted for more than 74K seconds (or more than 20 hours!). It’s probably a user who forgot to return the bike =D

Since the Female riders seems to have longer trips on average on age hasn’t prove to have much different on trip duration, let’s try grouping our data by age and gender and check the average trip duration for the group:

#group by Birth Year and gender, summarising trip duration using mean
grouped_df <- df %>%
              group_by(Birth.Year,Gender) %>%
              summarise(mean_trip_duration = mean(Trip.Duration)) %>%
              ungroup() %>%
              arrange(Birth.Year)

#line plot: mean of trip duration distributed by birthyear e per gender
ggplot(aes(x = Birth.Year, y = mean_trip_duration),
       data = subset(grouped_df, Gender %in% c('Male','Female'))) +
      geom_line(aes(color = Gender), stat = 'summary', fun.y = median) +
      ggtitle("Mean Trip Duration per Gender and Distributed by Birth Year") +
      xlab("Birth Year") +
      ylab("Mean of Trip Duration (in seconds)")

For the birthday range where we have most of the datapoints (between 1950 and 2000), apparently Female riders consistently have a higher trip duration than Male riders. We can better evaluate by plotting the actual data with the mean summary statistics overlaid:

# scatterplot - birth year and trip duration, by gender (with mean overlaid)
ggplot(aes(x = Birth.Year, y = Trip.Duration),
       data = subset(df, Gender %in% c('Male','Female'))) +
      geom_point(aes(color = Gender)) +
      geom_line(aes(color = Gender), stat = 'summary', 
                 fun.y = mean, size = 1) +
      geom_point(aes(fill = Gender), stat = 'summary', fun.y = mean, 
                   shape = 23, color = "black", size = 2) +
      ggtitle("Mean Trip Duration per Gender and Distributed by Birth Year") +
      xlab("Birth Year") +
      ylab("Mean of Trip Duration (in seconds)")

The peaks in the original graph are due to years with very few observations (and so may not be considered representative), but the range between 1950 and 2000 is still relevant on data volume.

Let’s check the same analysis, but for customers x subscribers:

#group by Birth Year and user type, summarising trip duration using mean
grouped_df <- df %>%
              group_by(Birth.Year,User.Type) %>%
              summarise(mean_trip_duration = mean(Trip.Duration)) %>%
              ungroup() %>%
              arrange(Birth.Year)
#line plot: mean of trip duration distributed by birthyear e per user type
ggplot(aes(x = Birth.Year, y = mean_trip_duration),
       data = subset(grouped_df, User.Type %in% c('Customer','Subscriber'))) +
    geom_line(aes(color = User.Type), stat = 'summary', fun.y = median) +
    ggtitle("Mean Trip Duration per User Type and Distributed by Birth Year") +
    xlab("Birth Year") +
    ylab("Mean of Trip Duration (in seconds)")

Customers, at first, seem to have longer trip than subscribers. Let’s plot the actual data again to make sure it’s representative:

# scatterplot - birth year and trip duration, by user type (with mean overlaid)
ggplot(aes(x = Birth.Year, y = Trip.Duration),
       data = subset(df, User.Type %in% c('Customer','Subscriber'))) +
    geom_point(aes(color = User.Type)) +
    geom_line(aes(color = User.Type), stat = 'summary', fun.y = mean, 
                size = 1) +
    geom_point(aes(fill = User.Type), stat = 'summary', fun.y = mean, 
                 shape = 23, color = "black", size = 2) +
    ggtitle("Mean Trip Duration per User Type and Distributed by Birth Year") +
    xlab("Birth Year") +
    ylab("Mean of Trip Duration (in seconds)")

Those higher means were based on few datapoints and, therefore, not really representative.

3.6 Stations

Finally, let’s check which are the 5 most popular stations on both starting and finishing rides:

#group by start station and summarise by count of trips
grouped_start <- df %>%
                 group_by(Start.Station) %>%
                 summarise (no_rows = length(Start.Station))

#find the cumulative count of trips
grouped_start <- grouped_start [order (-grouped_start$no_rows),] %>%
      mutate (Pareto = cumsum(grouped_start$no_rows/sum(grouped_start$no_rows)))

#get the top 5                 
grouped_start <- head(grouped_start,5)

#group by end station and summarise by count of trips
grouped_end <- df %>%
                 group_by(End.Station) %>%
                 summarise (no_rows = length(End.Station))

#find the cumulative count of trips
grouped_end <- grouped_end [order (-grouped_end$no_rows),] %>%
          mutate (Pareto = cumsum(grouped_end$no_rows/sum(grouped_end$no_rows)))
               
#get the top 5                 
grouped_end <- head(grouped_end,5)

#increase margin for y labels
par(mar=c(4,12,4,4))
# bar plot - most used start stations
barplot(grouped_start$no_rows,
        horiz=TRUE,
    density=NA,
    xlab="Cummulative Counts",
    axes=TRUE, names.arg=grouped_start$Start.Station, cex.names=0.5, las=1,
    main = "5 Most Popular Start Stations")
title(ylab="Stations", mgp=c(11,1,0), cex.lab=1.2)

#increase margin for y labels
par(mar=c(4,12,4,4))
# bar plot - most used end stations
barplot(grouped_end$no_rows,
        horiz=TRUE,
    density=NA,
    xlab="Cummulative Counts",
    axes=TRUE, names.arg=grouped_end$End.Station, cex.names=0.5, las=1,
    main = "5 Most Popular End Stations")
title(ylab="Stations", mgp=c(11,1,0), cex.lab=1.2)

Interesting! The 2 most popular start stations are also the 2 most popular end stations. Also, both Clinton St & Madison St and Lake Shore Dr & Monroe St appear in the top 5, although in different positions.

Let’s check how many (if any) trips start and finish in the same station by adding a new column to our dataset:

df <- df %>%
      mutate (Same.Station = identical(df$End.Station,df$Start.Station))
sum(df$Same.Station)
## [1] 0

There are not trips starting and ending at the same station.

4 Final Plots

I’ve chosen the charts that compare trip duration vs age for both gender and user type since they are the most insigthful charts as described later in the reflection. Also the trip duration distribution using plotly shows how concentrated the trips are and how age have a low influence in this specific variable.

#group by Birth Year and gender, summarising trip duration using mean
grouped_df <- df %>%
              group_by(Birth.Year,Gender) %>%
              summarise(mean_trip_duration = mean(Trip.Duration)) %>%
              ungroup() %>%
              arrange(Birth.Year)

#line plot: mean of trip duration distributed by birthyear e per gender
ggplot(aes(x = Birth.Year, y = mean_trip_duration),
       data = subset(grouped_df, Gender %in% c('Male','Female'))) +
      geom_line(aes(color = Gender), stat = 'summary', fun.y = median) +
      ggtitle("Mean Trip Duration per Gender and Distributed by Birth Year") +
      xlab("Birth Year") +
      ylab("Mean of Trip Duration (in seconds)")

For the birthday range where we have most of the datapoints (between 1950 and 2000), apparently Female riders consistently have a higher trip duration than Male riders.

grouped_per_weekday <- df %>%
              group_by(day) %>%
              summarise(mean_trip_duration = round(mean(Trip.Duration),0))

#barplot - mean of trip duration per weekday
ggplot(grouped_per_weekday, 
       aes(x = factor(day,c('Sunday','Monday','Tuesday','Wednesday',
                            'Thursday','Friday','Saturday')), 
           y = mean_trip_duration)) +
       geom_col() +
       geom_text(aes(label=mean_trip_duration), 
                   position=position_dodge(width=0.9), 
                   vjust=-0.25) +
       ggtitle("Average Trip Duration (in seconds) per Weekday") +
       xlab("Average Trip Duration (in seconds)") +
       ylab("Weekday")

The trip duration during weekends is considerably higher when compared to week days.

#set seed since a sample will be used
set.seed(1)
#make a sample of 10K
sample_df <- df[sample(nrow(df), 10000), ]
#scatter-plot -> trip duration distribution per birth year 
plot_ly(data = sample_df, 
        x = sample_df$Trip.Duration , 
        y = sample_df$Birth.Year,
        marker = list(size = 10,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',width = 2)),
        text = ~paste("Start Station: ", 
                      sample_df$Start.Station, 
                      '$<br>Finish Station:',
                      sample_df$End.Station)
        ) %>%
  layout(title = 'Trip duration distribution (in seconds) per Birth Year',
         yaxis = list(zeroline = FALSE, title = "Birth Year"),
         xaxis = list(zeroline = FALSE,title = 'Trip Duration (in seconds)'))

We can see that the majority of trips stay under the 10k seconds threshold and user age doesn’t really seem to have an impact on it. The ride with most duration departed from Clinton st and lasted for more than 74K seconds (or more than 20 hours!). It’s probably a user who forgot to return the bike =D

5 Reflection

Based on the analysis of the bike sharing database, we can evaluate a number of strategic information such as most engaged users’ profile and most popular stations.

First, regarding which day of the week has the most number of bike rides, it is obviously the weekend.

We can see that the customer base of this bike sharing company is unbalanced in terms of gender: a lot more Male riders than Female. However, Female riders have a longer trip duration on average, which probably also translates in higher NPV. Also, there’s no big difference in trip duration when looking at user age.

Regarding the stations, we can see that 2 stations are the most popular for both starting and ending trips: Streeter Dr & Grand Ave and Clinton St & Washington Blvd. They must be located in very important spots of Chicago.

This exercise has been quite interesting since I’m used to working with Python and it was nice to see that R has so many functionalities and packages. It also presents itself as a strong language for data analysis and modeling. The sintax was a little weird at first (specially when compared to traditional programming languages such as Java), but it was easy later to get the hang of it. I’d love to reach to Kaggle projects and see what kind of beautiful analysis and data viz can be made using it.